Temporal search in web archives

نویسنده

  • Klaus Berberich
چکیده

Web archives include both archives of contents originally published on the Web (e.g., the Internet Archive) but also archives of contents published long ago that are now accessible on the Web (e.g., the archive of The Times). Thanks to the increased awareness that web-born contents are worth preserving and to improved digitization techniques, web archives have grown in number and size. To unfold their full potential, search techniques are needed that consider their inherent special characteristics. This work addresses three important problems toward this objective andmakes the following contributions: • We present the Time-Travel Inverted indeX (TTIX) as an efficient solution to time-travel text search in web archives, allowing users to search only the parts of the web archive that existed at a user’s time of interest. • To counter negative effects that terminology evolution has on the quality of search results in web archives, we propose a novel query-reformulation technique, so that old but highly relevant documents are retrieved in response to today’s queries. • For temporal information needs, for which the user is best satisfied by documents that refer to particular times, we describe a retrieval model that integrates temporal expressions (e.g., “in the 1990s”) seamlessly into a language modeling approach. Experiments for each of the proposedmethods show their efficiency and effectiveness, respectively, and demonstrate the viability of our approach to search in web archives.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Information Retrieval Evaluation over Web Archives

We present the first overview of a web archive user profile and the searching technology that supports it. Most web archives only support URL search and just a few provide fulltext search in response to users’ expectations. Their technology is essentially based on web search engines, which ignore the temporal dimension of collections. As consequence, the quality of results is poor. We suggest t...

متن کامل

Search and Access Strategies for Web Archives

The Web has become the main publication medium worldwide, covering almost every facet of human activity. In many cases, the Web is the only medium where such information is recorded. However, the Web is an ephemeral medium whose contents are constantly changing and new information is rapidly replacing old information, and hence the critical importance of establishing web archives to capture at ...

متن کامل

Tempas: Temporal Archive Search Based on Tags

Limited search and access patterns over Web archives have been well documented. One of the key reasons is the lack of understanding of the user access patterns over such collections, which in turn is attributed to the lack of effective search interfaces. Current search interfaces for Web archives are (a) either purely navigational or (b) have sub-optimal search experience due to ineffective ret...

متن کامل

Evaluating Web Archive Search Systems

The information published on the web, a representation of our collective memory, is rapidly vanishing. At least 77 web archives have been developed to cope with the web’s transience problem, but despite their technology having achieved a good maturity level, the retrieval effectiveness of the search services they provide still presents unsatisfactory results. In this work, we propose an evaluat...

متن کامل

Characterizing Search Behavior in Web Archives

Web archives are a huge source of information to mine the past. However, tools to explore web archives are still in their infancy, in part due to the reduced knowledge that we have of their users. We contribute to this knowledge by presenting the first search behavior characterization of web archive users. We obtained detailed statistics about the users’ sessions, queries, terms and clicks from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010